Have you ever read horror stories when you were young? Have you ever been scared of horror fiction? No matter the answer is yes or no, let us make an analysis of the horror fiction. We will analyze texts from Edgar Allan Poe, Mary Shelley, and HP Lovecraft with the goal to find out whether they are appropriate for children to read.
2.1 Load Libraries
We will take a range of libraries for general data wrangling and general visualisation together with more specialised language-processing tools.
packages.used <- c("ggplot2", "dplyr", "tibble", "tidyr", "stringr", "tidytext", "topicmodels", "wordcloud2", "wordcloud", "ggridges","corrplot")
# check packages that need to be installed.
packages.needed <- setdiff(packages.used, intersect(installed.packages()[,1], packages.used))
# install additional packages
if(length(packages.needed) > 0) {
install.packages(packages.needed, dependencies = TRUE, repos = 'http://cran.us.r-project.org')
}
library(ggplot2)
library(dplyr)
library(tibble)
library(tidyr)
library(stringr)
library(tidytext)
library(topicmodels)
library(wordcloud2)
library(wordcloud)
library(ggridges)
library(corrplot)
source("../libs/multiplot.R")
2.2 Load Data
spooky <- read.csv('../data/spooky.csv', as.is = TRUE)
head(spooky)
## id
## 1 id26305
## 2 id17569
## 3 id11008
## 4 id27763
## 5 id12958
## 6 id22965
## text
## 1 This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.
## 2 It never once occurred to me that the fumbling might be a mere mistake.
## 3 In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.
## 4 How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.
## 5 Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.
## 6 A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services.
## author
## 1 EAP
## 2 HPL
## 3 EAP
## 4 MWS
## 5 HPL
## 6 MWS
summary(spooky)
## id text author
## Length:19579 Length:19579 Length:19579
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
View(spooky)
we found each sentence is assigned an ID. So we checked whether the ID and text of each row was unique
length(unique(spooky$id)) == nrow(spooky)
## [1] TRUE
length(unique(spooky$text)) == nrow(spooky)
## [1] TRUE
Any missing value?
sum(is.na(spooky))
## [1] 0
Change author name to be a factor variable
spooky$author <- as.factor(spooky$author)
4.1 Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. Therefore, we first use the unnest_tokens() function to drop all punctuation and transform all words into lower case.
# Make a table with one word per row and remove `stop words` (i.e. the common words).
spooky_wrd <- unnest_tokens(spooky, word, text)
spooky_wrd <- anti_join(spooky_wrd, stop_words, by = "word")
5.1 Wordcloud
Now we use package “Wordcloud2” to generate some wordclouds to visualise each author’s work. But these plots can only help us have an overview of author’s work. Then, we will use other methods to see the word frequency in each author’s work in the following parts of this charpter.
5.1.1 Wordcloud for EAP
spooky_wrd_EAP <- spooky_wrd%>%
dplyr::count(author,word)%>%
dplyr::group_by(author)%>%
tidyr::spread(author, n)%>%
dplyr::select(word, EAP) %>%
dplyr::filter(!is.na(EAP))%>%
dplyr::arrange(desc(EAP))
figPath = system.file("examples/octopus.jpg",package = "wordcloud2")
wordcloud2(spooky_wrd_EAP, figPath = figPath, size = 1.5, color = "random-light", backgroundColor = "black")
# words <- count(group_by(spooky_wrd, word))$word
# freqs <- count(group_by(spooky_wrd, word))$n
# words_freqs <- data.frame(words,freqs)
# wordcloud(word, n, max.words = 30, color = "purple4")
5.1.2 Wordcloud for MWS
spooky_wrd_MWS <- spooky_wrd%>%
dplyr::count(author,word)%>%
dplyr::group_by(author)%>%
tidyr::spread(author, n)%>%
dplyr::select(word, MWS) %>%
dplyr::filter(!is.na(MWS))%>%
dplyr::arrange(desc(MWS))
figPath = system.file("examples/white-ghost-hi.jpg",package = "wordcloud2")
wordcloud2(spooky_wrd_MWS, figPath = figPath, size = 1.5, color = "random-light", backgroundColor = "black")
5.1.3 Wordcloud for HPL
spooky_wrd_HPL <- spooky_wrd%>%
dplyr::count(author,word)%>%
dplyr::group_by(author)%>%
tidyr::spread(author, n)%>%
dplyr::select(word, HPL) %>%
dplyr::filter(!is.na(HPL))%>%
dplyr::arrange(desc(HPL))
figPath = system.file("examples/black-bird-hi.png",package = "wordcloud2")
wordcloud2(spooky_wrd_HPL, figPath = figPath, size = 1.5, color = "random-light", backgroundColor = "black")
5.2 In this part, we could have a clear thought of the word frequency of each author’s work. We can find out which words they used most frequently.
# Counts number of times each author used each word.
author_words <- count(group_by(spooky_wrd, word, author))
# Counts number of times each word was used.
all_words <- rename(count(group_by(spooky_wrd, word)), all = n)
author_words <- left_join(author_words, all_words, by = "word")
author_words <- arrange(author_words, desc(all))
author_words <- ungroup(head(author_words, 81))
ggplot(author_words) +
geom_col(aes(reorder(word, all, FUN = min), n, fill = author)) +
xlab(NULL) +
coord_flip() +
facet_wrap(~ author) +
theme(legend.position = "none")
We can already see that some words are almost equally frequent for all authors, such as “time”. In contrast, “love” is clearly more used by Shelley than by Lovecraft. The word “half” is only found in Poe and Lovecraft, but not in Shelley’s work at a notable frequency.
5.3 Author-dependent word frequencies
Lets start to plot the word frequencies (log scale) comparing two authors at a time and see how words distribute on the plane. Words that are close to the line (y = x) have similar frequencies in both sets of texts. While words that are far from the line are words that are found more in one set of texts than another.
As we can see in the plots below, there are some words close to the line but most of the words are around the line showing a difference between the frequencies.
5.3.1 EAP vs. HPL
#We need to spread the author (key) and the word frequency (value) across multiple columns. There might be NAs if it was not uesd by an author
tib.word_freqs_author <- spooky_wrd %>%
dplyr::count(author, word) %>%
dplyr::group_by(author) %>%
dplyr::mutate(word_freq_author = n/sum(n)) %>%
dplyr::select(-n)
tib.word_freqs_author <- tib.word_freqs_author%>%
tidyr::spread(author, word_freq_author)
#Removing incomplete cases - not all words are common for the authors
# when spreading words to all authors - some will get NAs (if not used
# by an author)
author_words_EAP_vs_HPL <- tib.word_freqs_author %>%
dplyr::select(word, EAP, HPL) %>%
dplyr::filter(!is.na(EAP) & !is.na(HPL))
ggplot(data = author_words_EAP_vs_HPL, mapping = aes(x = EAP, y = HPL, color = abs(EAP - HPL))) +
geom_abline(color = "red", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = scales::percent_format()) +
scale_y_log10(labels = scales::percent_format()) +
theme(legend.position = "none") +
labs(y = "HP Lovecraft", x = "Edgard Allan Poe")
5.3.2 EAP vs. MWS
#Removing incomplete cases - not all words are common for the authors
# when spreading words to all authors - some will get NAs (if not used
# by an author)
author_words_EAP_vs_MWS <- tib.word_freqs_author %>%
dplyr::select(word, EAP, MWS) %>%
dplyr::filter(!is.na(EAP) & !is.na(MWS))
ggplot(data = author_words_EAP_vs_MWS, mapping = aes(x = EAP, y = MWS, color = abs(EAP - MWS))) +
geom_abline(color = "red", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = scales::percent_format()) +
scale_y_log10(labels = scales::percent_format()) +
theme(legend.position = "none") +
labs(y = "Mary Wollstonecraft Shelley", x = "Edgard Allan Poe")
5.3.3 HPL vs. MWS
author_words_HPL_vs_MWS <- tib.word_freqs_author %>%
dplyr::select(word, HPL, MWS) %>%
dplyr::filter(!is.na(HPL) & !is.na(MWS))
ggplot(data = author_words_HPL_vs_MWS, mapping = aes(x = HPL, y = MWS, color = abs(HPL - MWS))) +
geom_abline(color = "red", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = scales::percent_format()) +
scale_y_log10(labels = scales::percent_format()) +
theme(legend.position = "none") +
labs(y = "Mary Wollstonecraft Shelley", x = "HP Lovecraft")
5.4 Correlation – Pearson
Next, we will use Pearson for linearity method to calculate a correlation between the authors. In this way, we can verify the similarity or differences these sets of word frequencies by author.
cor.tib.word_freqs_author <- tib.word_freqs_author[2:4]%>%
cor(use="complete.obs", method="spearman") %>%
corrplot(type="lower",
method="pie",
diag = F)
There is a correlation of around 0.48 to 0.5 between the different authors.
5.5 TF-IDF
TF stands for term frequency or how often a word appears in a text and it is what is studied above in the word cloud. IDF stands for inverse document frequncy, and it is a way to pay more attention to words that are rare within the entire set of text data that is more sophisticated than simply removing stop words. Multiplying these two values together calculates a term’s tf-idf, which is the frequency of a term adjusted for how rarely it is used. We’ll use tf-idf as a heuristic index to indicate how frequently a certain author uses a word relative to the frequency that ll the authors use the word. Therefore we will find words that are characteristic for a specific author, a good thing to have if we are interested in solving the author identification problem.
frequency <- count(spooky_wrd, author, word)
tf_idf <- bind_tf_idf(frequency, word, author, n)
head(tf_idf)
## # A tibble: 6 x 6
## author word n tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 EAP à 9 0.000124 1.10 0.000136
## 2 EAP a.m 3 0.0000412 0.405 0.0000167
## 3 EAP aaem 1 0.0000137 1.10 0.0000151
## 4 EAP ab 1 0.0000137 1.10 0.0000151
## 5 EAP aback 2 0.0000275 1.10 0.0000302
## 6 EAP abandon 7 0.0000961 0 0
tail(tf_idf)
## # A tibble: 6 x 6
## author word n tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 MWS youth's 1 0.0000160 0.405 0.00000649
## 2 MWS youthful 10 0.000160 0 0
## 3 MWS youths 2 0.0000320 0.405 0.0000130
## 4 MWS zaimi 2 0.0000320 1.10 0.0000352
## 5 MWS zeal 7 0.000112 0 0
## 6 MWS zest 3 0.0000480 0 0
tf_idf <- arrange(tf_idf, desc(tf_idf))
tf_idf <- mutate(tf_idf, word = factor(word, levels = rev(unique(word))))
# Grab the top thirty tf_idf scores in all the words
tf_idf_30 <- top_n(tf_idf, 30, tf_idf)
ggplot(tf_idf_30) +
geom_col(aes(word, tf_idf, fill = author)) +
labs(x = NULL, y = "TF-IDF values") +
theme(legend.position = "top", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9))
Note that in the above, many of the words recognized by their tf-idf scores are names. This makes sense – if we see text referencing Raymond, Idris, or Perdita, we know almost for sure that MWS is the author. But some non-names stand out. EAP often uses “monsieur” and “jupiter” while HPL uses the words “bearded” and “attic” more frequently than the others. We can also look at the most characteristic terms per author.
# Grab the top twenty tf_idf scores in all the words for each author
tf_idf <- ungroup(top_n(group_by(tf_idf, author), 20, tf_idf))
ggplot(tf_idf) +
geom_col(aes(word, tf_idf, fill = author)) +
labs(x = NULL, y = "tf-idf") +
theme(legend.position = "none") +
facet_wrap(~ author, ncol = 3, scales = "free") +
coord_flip() +
labs(y = "TF-IDF values")
6. Data Visualization
We’ll do some simple numerical summaries of the data to provide some nice visualizations.
p1 <- ggplot(spooky) +
geom_bar(aes(author, fill = author)) +
theme(legend.position = "none")
spooky$sen_length <- str_length(spooky$text)
head(spooky$sen_length)
## [1] 231 71 200 206 174 468
p2 <- ggplot(spooky) +
geom_density_ridges(aes(sen_length, author, fill = author)) +
scale_x_log10() +
theme(legend.position = "none") +
labs(x = "Sentence length [# characters]")
spooky_wrd$word_length <- str_length(spooky_wrd$word)
head(spooky_wrd$word_length)
## [1] 7 8 5 12 10 7
p3 <- ggplot(spooky_wrd) +
geom_density(aes(word_length, fill = author), bw = 0.05, alpha = 0.3) +
scale_x_log10() +
theme(legend.position = "none") +
labs(x = "Word length [# characters]")
layout <- matrix(c(1, 2, 1, 3), 2, 2, byrow = TRUE)
multiplot(p1, p2, p3, layout = layout)
## Loading required package: grid
## Picking joint bandwidth of 0.0414
Part II Further analysis – are they appropriate for children to read?
The Motion Picture Association of America (MPAA) film rating system is used in the United States and its territories to rate a film’s suitability for certain audiences based on its content. The MPAA rating system is one of various motion picture rating systems that are used to help parents decide what films are appropriate for their children.
We would like to apply the idea of movie rating to book classification to help parents and children choose age-appropriate books.
When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. In this part, we will do some basic sentiment analysis.
1.1 Use NRC lexicon
get_sentiments('nrc')
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
sentiments <- inner_join(spooky_wrd, get_sentiments('nrc'), by = "word")
count(sentiments, sentiment)
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 anger 9869
## 2 anticipation 11258
## 3 disgust 7697
## 4 fear 13927
## 5 joy 10190
## 6 negative 23674
## 7 positive 25175
## 8 sadness 12674
## 9 surprise 6199
## 10 trust 13655
count(sentiments, author, sentiment)
## # A tibble: 30 x 3
## author sentiment n
## <fct> <chr> <int>
## 1 EAP anger 2962
## 2 EAP anticipation 3982
## 3 EAP disgust 2261
## 4 EAP fear 4194
## 5 EAP joy 3302
## 6 EAP negative 7659
## 7 EAP positive 9291
## 8 EAP sadness 3938
## 9 EAP surprise 2244
## 10 EAP trust 5116
## # ... with 20 more rows
ggplot(count(sentiments, sentiment)) +
geom_col(aes(sentiment, n, fill = sentiment))
ggplot(count(sentiments, author, sentiment)) +
geom_col(aes(sentiment, n, fill = sentiment)) +
facet_wrap(~ author) +
coord_flip() +
theme(legend.position = "none")
From above plot, we found that not much of a positive mood in HPL’s works. MWS used positive words almost as much as negative ones. There’s much “trust” and “joy” to counteract all the “sadness”, “fear”, and “anger” in the world.
Then, we will seperate positive and negative words of each author to see more details.
sentiments_EAP <- sentiments%>%
filter(author == "EAP" & (sentiment == "positive" | sentiment == "negative")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
group_by(sentiments_EAP,sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to negative/positive sentiment", x = NULL) +
coord_flip() +
ggtitle("Edgar Allan Poe - Sentiment analysis")
It is interesting that his most-used negative words are “words” and “doubt” - pointing to a different kind of horror of the imagination. His most-uesd positive word is “found”.This made me much more interested in his works. I can’t help wondering what I can find in his work.
sentiments_HPL <- sentiments%>%
filter(author == "HPL" & (sentiment == "positive" | sentiment == "negative")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
group_by(sentiments_HPL,sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to negative/positive sentiment", x = NULL) +
coord_flip() +
ggtitle("HP Lovecraft - Sentiment analysis")
We found that not much of a positive mood in HPL’s works. Similarly to EAP, the most-uesd positive word is “found”. We also found that there were more words related to color than other two authors. HPL used “black” represents something negative while “white” and “green” are used for positive. In my opinion, “white” could be related to clean and honest and “green” is probably a color with full of hope and love.
sentiments_MWS <- sentiments%>%
filter(author == "MWS" & (sentiment == "positive" | sentiment == "negative")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
group_by(sentiments_MWS,sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to negative/positive sentiment", x = NULL) +
coord_flip() +
ggtitle("Mary Shelley - Sentiment analysis")
We found that Mary Shelley used positive words almost as much as negative ones. She used “love” a lot as a positive word. “Death” is clearly her most common negative word.
We have a question now. Will the result be different if we use other lexicons?
Let’s change from “nrc” to “bing” to get more information about sentiment analysis.
1.2 Use Bing Lexicon
get_sentiments('bing')
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
sentiments_bing <- inner_join(spooky_wrd, get_sentiments('bing'), by = "word")
count(sentiments_bing, sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 22919
## 2 positive 13551
count(sentiments_bing, author, sentiment)
## # A tibble: 6 x 3
## author sentiment n
## <fct> <chr> <int>
## 1 EAP negative 7188
## 2 EAP positive 4844
## 3 HPL negative 7589
## 4 HPL positive 2706
## 5 MWS negative 8142
## 6 MWS positive 6001
ggplot(count(sentiments_bing, sentiment)) +
geom_col(aes(sentiment, n, fill = sentiment))
ggplot(count(sentiments_bing, author, sentiment)) +
geom_col(aes(sentiment, n, fill = sentiment)) +
facet_wrap(~ author) +
coord_flip() +
theme(legend.position = "none")
After applying “bing” lexicon, we found that overall, these three authors used more negative words than positive words.
Then, we will also seperate positive and negative words of each author to see more details.
sentiments_bing_EAP <- sentiments_bing%>%
filter(author == "EAP") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
group_by(sentiments_bing_EAP,sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to negative/positive sentiment", x = NULL) +
coord_flip() +
ggtitle("Edgar Allan Poe - Sentiment analysis")
“doubt” and “death” became the most uesd negative words in EAP’s works. While the word “word” which is the most used negative word in NRC lexicon analysis is no loner existing. The words in positive part are also different from NRC lexicon analysis.
sentiments_bing_HPL <- sentiments_bing%>%
filter(author == "HPL") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
group_by(sentiments_bing_HPL,sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to negative/positive sentiment", x = NULL) +
coord_flip() +
ggtitle("HP Lovecraft - Sentiment analysis")
There is not much positive mood in HPL’s works. The most uesd negative word and positive word are both different from NRC lexicon analysis. In this result, his most uesd negative word is “strange” which is somehow more reasonable than “ancient” in NRC.
sentiments_bing_MWS <- sentiments_bing%>%
filter(author == "MWS") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
group_by(sentiments_bing_MWS,sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to negative/positive sentiment", x = NULL) +
coord_flip() +
ggtitle("Mary Shelley - Sentiment analysis")
From above plot, we can find that she used “love” a lot as a positive word. “Death” is clearly her most common negative word.To my surprise, the most used negative and positive words are the same with the NRC lexicon analysis. She used positive words almost as much as negative ones.
After comparing the results generated by different lexicons, I decided to use “bing” lexicon to do further analysis. Next, we will only study “negative” words to find some clues about the theroy of book classification.
bing_neg <- filter(get_sentiments('bing'), sentiment == "negative")
bing_neg
## # A tibble: 4,782 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 abnormal negative
## 4 abolish negative
## 5 abominable negative
## 6 abominably negative
## 7 abominate negative
## 8 abomination negative
## 9 abort negative
## 10 aborted negative
## # ... with 4,772 more rows
negative <- inner_join(spooky_wrd, bing_neg, by = "word")
head(negative)
## id author word word_length sentiment
## 1 id26305 EAP dungeon 7 negative
## 2 id17569 HPL mistake 7 negative
## 3 id11008 EAP incessantly 11 negative
## 4 id12958 HPL perplexed 9 negative
## 5 id12958 HPL steals 6 negative
## 6 id22965 MWS intense 7 negative
count(negative, word, sort = TRUE)
## # A tibble: 2,403 x 2
## word n
## <chr> <int>
## 1 death 380
## 2 strange 283
## 3 fear 240
## 4 dark 223
## 5 dead 203
## 6 lost 181
## 7 object 165
## 8 doubt 159
## 9 fell 157
## 10 wild 157
## # ... with 2,393 more rows
First, we plot a frequency comparison of these “negative” words.
neg_words <- count(group_by(negative, word, author))
neg_words_all <- count(group_by(negative, word))
neg_words <- left_join(neg_words, neg_words_all, by = "word")
neg_words <- arrange(neg_words, desc(n.y))
neg_words <- ungroup(head(neg_words, 81))
ggplot(neg_words) +
geom_col(aes(reorder(word, n.y, FUN = min), n.x, fill = author)) +
xlab(NULL) +
coord_flip() +
facet_wrap(~ author) +
theme(legend.position = "none")
Then, we go a step further and assign a “negativity fraction” to each sentence; defined in the same way as the other index: # negative / (# negative + # positive). We plot the distribution of these negativity indeces for the three authors:
p1 <- ggplot(sentiments_bing, aes(author, fill = sentiment)) + geom_bar(position = "fill")
p2 <- group_by(sentiments_bing, author, id, sentiment) %>%
count() %>%
spread(sentiment, n, fill = 0) %>%
group_by(author, id) %>%
summarise(neg = sum(negative),
pos = sum(positive)) %>%
arrange(id) %>%
mutate(frac_neg = neg/(neg + pos)) %>%
ggplot(aes(frac_neg, fill = author)) +
geom_density(bw = .2, alpha = 0.3) +
theme(legend.position = "right") +
labs(x = "Fraction of negative words per sentence")
layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)
HPL’s works are more negative than EAP and MWS. From the above plots, we know clearly when the fraction of negative words per sentence is between 0 and 0.5, the probability of negative words of MWS is lager than EAP and HPL.
The Motion Picture Association of America (MPAA) film rating system has five components: Violence, Language, Substance abuse, Nudity and Sexual content. Considering the last 4 are more difficult to choose the standard for comparison, so here we only analyze the first one – Violence.
We first built a dictionary contains all the common violent words. The violent words come from http://www.thesaurus.com. For example: crazy, cruel, fierce.
Then, we counted these words appeared in entire dataset and in each author’s works.
#build a dictionary which contains all the common violence words.
Vio <- data.frame(word=c('brutal','crazy','cruel','fierce','homicidal','hysterical','murderous','passionate','potent','powerful','savage','uncontrollable','vicious','agitated','aroused','berserk','bloodthirsty','coercive','demoniac','desperate','distraught','disturbed','enraged','fiery','forceful','forcible','frantic','fuming','furious','great','headstrong','hotheaded','impassioned','impetuous','inflamed','intemperate','mad','maddened','maniacal','mighty','raging','riotous','rough','strong','ungovernable','unrestrained','urgent','vehement','wild'),stringsAsFactors=FALSE)
head(Vio)
## word
## 1 brutal
## 2 crazy
## 3 cruel
## 4 fierce
## 5 homicidal
## 6 hysterical
#count these words appeared in spooky data
vio_words <- inner_join(Vio, spooky_wrd, by = "word")
lapply(vio_words, function(x)which(is.na(x)))
## $word
## integer(0)
##
## $id
## integer(0)
##
## $author
## integer(0)
##
## $word_length
## integer(0)
vio_words <- count(group_by(vio_words, word,author))
head(vio_words)
## # A tibble: 6 x 3
## # Groups: word, author [6]
## word author n
## <chr> <fct> <int>
## 1 agitated EAP 11
## 2 agitated HPL 1
## 3 agitated MWS 14
## 4 aroused EAP 11
## 5 aroused HPL 15
## 6 aroused MWS 1
ggplot(vio_words) +
geom_col(aes(word,n,fill = word))
ggplot(vio_words) +
geom_col(aes(word, n, fill = word)) +
facet_wrap(~ author) +
coord_flip() +
theme(legend.position = "none")
We found that these violent terms are indeed distributed among the three authors’ works. Due to the lack of further information, we can not set a standard to justify whether there is too much violent mood in their works. Maybe in the future, we will have a baseline, for example, when these violent terms make up about 20% or more of the entire article, children need to read such books under the guidance of their parents.
Topic modeling is a method for unsupervised classification of documents by themes, similar to clustering on numeric data. We’re trying to look through the content of each author’s works to identify whether they are appropriate for children to read by running Latent Dirichlet Allocation.
# Counts how many times each word appears in each sentence
sent_wrd_freqs <- count(spooky_wrd, id, word)
head(sent_wrd_freqs)
## # A tibble: 6 x 3
## id word n
## <chr> <chr> <int>
## 1 id00001 content 1
## 2 id00001 idris 1
## 3 id00001 mine 1
## 4 id00001 resolve 1
## 5 id00002 accursed 1
## 6 id00002 city 1
# Creates a DTM matrix
spooky_wrd_tm <- cast_dtm(sent_wrd_freqs, id, word, n)
spooky_wrd_tm
## <<DocumentTermMatrix (documents: 19467, terms: 24941)>>
## Non-/sparse entries: 193944/485332503
## Sparsity : 100%
## Maximal term length: 19
## Weighting : term frequency (tf)
length(unique(spooky_wrd$id))
## [1] 19467
length(unique(spooky_wrd$word))
## [1] 24941
The matrix spooky_wrd_tm is a sparse matrix with 19467 rows, corresponding to the 19467 ids (or originally, sentences) in the spooky_wrd dataframe, and 24941 columns corresponding to the total number of unique words in the spooky_wrd dataframe. So each row of spooky_wrd_tm corresponds to one of the original sentences. The value of the matrix at a certain position is then the number of occurences of that word (determined by the column) in this specific sentence (determined by the row). Since most sentence/word pairings don’t occur, the matrix is sparse meaning there are many zeros.
For LDA we must pick the number of possible topics. I tried 12, 10, 8 and 6. Finnally I chose 6.
spooky_wrd_lda <- LDA(spooky_wrd_tm, k = 6, control = list(seed = 1234))
spooky_wrd_topics <- tidy(spooky_wrd_lda, matrix = "beta")
spooky_wrd_topics
## # A tibble: 149,646 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 content 0.000198
## 2 2 content 0.000316
## 3 3 content 0.000359
## 4 4 content 0.0000223
## 5 5 content 0.0000294
## 6 6 content 0.000107
## 7 1 idris 0.000499
## 8 2 idris 0.000562
## 9 3 idris 0.000467
## 10 4 idris 0.000406
## # ... with 149,636 more rows
We note that in the above we use the tidy function to extract the per-topic-per-word probabilities, called “beta” or \(\beta\), for the model. The final output has a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic.
# Grab the top five words for each topic.
spooky_wrd_topics_5 <- ungroup(top_n(group_by(spooky_wrd_topics, topic), 5, beta))
spooky_wrd_topics_5 <- arrange(spooky_wrd_topics_5, topic, -beta)
spooky_wrd_topics_5 <- mutate(spooky_wrd_topics_5, term = reorder(term, beta))
ggplot(spooky_wrd_topics_5) +
geom_col(aes(term, beta, fill = factor(topic)), show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 3) +
coord_flip()
In the above, we found several words appear in many topics. For instance, topic 1,4,5,6 all contain the word “time”. “found” is also a word contained by topic 2,3 and 6. At this point, I seem a bit understated why “found” is a negative term. When “found” appears together with other words related to people’s body part like “eyes”,“head”,“heart” in a topic, it really makes people feel creepy. What’s more, there are also some words that described time, such as “night”, “dark” and also “time” itself which made me feel even more frightening.
Also, we see that these 6 topics are quite similar. I can hardly tell the difference between these topics. The only thing I can be sure of is that these words remind me of all the horror stories I might think of in my life.
Therefore, let’s study terms that have the greatest difference in probabilities between the topics, ignoring the words that are shared with similar frequency between topics. We choose only the first 3 topics as example and visualise the differences by plotting log ratios: \(log_{10}(\beta \text{ of topic x }/ \beta \text{ of topic y})\). So if a word is 10 times more frequent in topic x the log ratio will be 1, whereas it will be -1 if the word is 10 times more frequent in topic y.
spooky_wrd_topics <- mutate(spooky_wrd_topics, topic = paste0("topic", topic))
spooky_wrd_topics <- spread(spooky_wrd_topics, topic, beta)
spooky_wrd_topics_12 <- filter(spooky_wrd_topics, topic1 > .001 | topic2 > .001)
spooky_wrd_topics_12 <- mutate(spooky_wrd_topics_12, log_ratio = log10(topic2 / topic1))
spooky_wrd_topics_12 <- group_by(spooky_wrd_topics_12, direction = log_ratio > 0)
spooky_wrd_topics_12 <- ungroup(top_n(spooky_wrd_topics_12, 5, abs(log_ratio)))
spooky_wrd_topics_12 <- mutate(spooky_wrd_topics_12, term = reorder(term, log_ratio))
p1 <- ggplot(spooky_wrd_topics_12) +
geom_col(aes(term, log_ratio, fill = log_ratio > 0)) +
theme(legend.position = "none") +
labs(y = "Log ratio of beta in topic 2 / topic 1") +
coord_flip()
spooky_wrd_topics_23 <- filter(spooky_wrd_topics, topic2 > .001 | topic3 > .001)
spooky_wrd_topics_23 <- mutate(spooky_wrd_topics_23, log_ratio = log10(topic3 / topic2))
spooky_wrd_topics_23 <- group_by(spooky_wrd_topics_23, direction = log_ratio > 0)
spooky_wrd_topics_23 <- ungroup(top_n(spooky_wrd_topics_23, 5, abs(log_ratio)))
spooky_wrd_topics_23 <- mutate(spooky_wrd_topics_23, term = reorder(term, log_ratio))
p2 <- ggplot(spooky_wrd_topics_23) +
geom_col(aes(term, log_ratio, fill = log_ratio > 0)) +
theme(legend.position = "none") +
labs(y = "Log ratio of beta in topic 3 / topic 2") +
coord_flip()
spooky_wrd_topics_13 <- filter(spooky_wrd_topics, topic3 > .001 | topic1 > .001)
spooky_wrd_topics_13 <- mutate(spooky_wrd_topics_13, log_ratio = log10(topic3 / topic1))
spooky_wrd_topics_13 <- group_by(spooky_wrd_topics_13, direction = log_ratio > 0)
spooky_wrd_topics_13 <- ungroup(top_n(spooky_wrd_topics_13, 5, abs(log_ratio)))
spooky_wrd_topics_13 <- mutate(spooky_wrd_topics_13, term = reorder(term, log_ratio))
p3 <- ggplot(spooky_wrd_topics_13) +
geom_col(aes(term, log_ratio, fill = log_ratio > 0)) +
theme(legend.position = "none") +
labs(y = "Log ratio of beta in topic 3 / topic 1") +
coord_flip()
layout <- matrix(c(1,2,3), 3, 1, byrow = TRUE)
multiplot(p1, p2, p3, layout = layout)
In the above, I guess topic 1 may occur in the wild because this topic contains words like “moon”, “sky” and “air” which are more popular than other two. Topic 2 may happen at a late night party. The story may have taken place in a study room full of books, all of them standing together.
Think further
Due to the lack of appropriate judgments of children, the film classification system has largely avoided the children’s exposure to violence, drugs and other works that are likely to have adverse effects on them. Similarly, in the literary world, similar hierarchies are needed to help children grow healthily. In this project, I didn’t find a way to set up a standard to justify whether there is too much violent mood in their works. I would like to explore more in this area and hope in the future, we will have a baseline, for example, when these violent terms make up about 20% or more of the entire article, children need to read such books under the guidance of their parents.